Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations
نویسندگان
چکیده
Introduction Structure in documents has been a part of Information Retrieval as long as the field has existed. There have always been documents with metadata information such as authors and creation dates. Researchers have always recognized that document structure is important to effective retrieval. The earliest systems supporting structure focused on providing support for querying of the fielded information. This has strong connections to research in library sciences, where fielded search of citations is important. A common approach to handling fielded queries was to treat constraints literally, perhaps providing a ranking of the query corresponding to how well the multiple constraints are met. A typical example of an early citation search system was the Norton Cotton Cancer Center (NCCC) On-Line Personal Bibliographic Retrieval System [10]. The bibliographic system stored citations of documents (articles, books, journals, etc.) in a fielded database. Citations were indexed with keywords in a controlled vocabulary or general language. Titles, authors, publication information, and call numbers were also indexed in fields. Searches could query the entire citation or over the controlled vocabulary. The searches were either conjunctions or disjunctions of the clauses; more sophisticated search capabilities were not deemed necessary. Results were ordered by the author field. A more sophisticated example of citation search is the SCAT-IR system [37]. SCAT-IR indexed similar fields as the NCCC bibliographic system, but allowed additional query structures. Each field could be queried in a clause, and query clauses could be combined with Boolean ANDs and ORs. Yet result sets were not ordered by any estimate of relevance. Fox summarized early work with structured documents in [11]. Much work to that date was empirical, with no analysis of the conditions in which structure is informative. Most tasks limited the retrieval unit to documents, although some early work with passage retrieval had been performed. Fox described some experiments that found a vector space system performed better on sections than on passages, but performed no analysis as to why. Fox further described how soft Boolean matching functions such as the P-Norm formalism can be extended to matching in complex documents. Up to this point there was little consideration of what the best unit of retrieval is and almost no consideration of relationship between fields of information in a document. The introduction of Inference Networks by Turtle and Croft [50] provided one of the first retrieval models explicitly designed to handle multiple representations of information …
منابع مشابه
Latent Dirichlet Markov Allocation for Sentiment Analysis
In recent years probabilistic topic models have gained tremendous attention in data mining and natural language processing research areas. In the field of information retrieval for text mining, a variety of probabilistic topic models have been used to analyse content of documents. A topic model is a generative model for documents, it specifies a probabilistic procedure by which documents can be...
متن کاملAnnotation-based Document Retrieval with Four-Valued Probabilistic Datalog
The COLLATE system (collaboratory for annotation, indexing and retrieval of digitized historical archive material) provides film researchers with a collaborative environment in which historic documents about European films can be analysed, interpreted and discussed, using nested annotations and discourse structure relations among them. Annotations are metadata, and annotation threads form a hyp...
متن کاملProbabilistic Models for Expert Finding
A common task in many applications is to find persons who are knowledgeable about a given topic (i.e., expert finding). In this paper, we propose and develop a general probabilistic framework for studying expert finding problem and derive two families of generative models (candidate generation models and topic generation models) from the framework. These models subsume most existing language mo...
متن کاملProbabilistic Logical Information Retrieval for Content, Hypertext, and Database Querying
Classical retrieval models support content-oriented searching for documents using a set of words as data model. However, in hypertext and database applications we want to consider the link structure and attribute values of documents in addition to the pure content. In this paper, we present a framework based on probabilistic logical retrieval for describing the retrieval function for a query wh...
متن کاملVrije Competitie 2008 Exacte Wetenschappen
2.a) Summary. The project aims to develop new retrieval models and algorithms for searching and browsing in scientific literature. Two important recent developments in the scientific literature production process form the concrete motivation for this project: (1) semantically rich document structuring standards, and (2) increasingly rich keyword annotations that capture domain knowledge. The dr...
متن کاملProbabilistic Models over Ordered Partitions with Applications in Document Ranking and Collaborative Filtering
Ranking is an important task for handling a large amount of content. Ideally, training data for supervised ranking would include a complete rank of documents (or other objects such as images or videos) for a particular query. However, this is only possible for small sets of documents. In practice, one often resorts to document rating, in that a subset of documents is assigned with a small numbe...
متن کامل